Computer program, pose derivation method, and pose derivation device

ABSTRACT

A method includes : obtaining a first 3D model point cloud associated with surface feature elements of a 3D model corresponding to a real object; obtaining a 3D surface point cloud from current depth image data of the real object; obtaining a second 3D model point cloud associated with 2D model points in a model contour; obtaining a 3D image contour point cloud at respective intersections of first imaginary lines and second imaginary lines; and deriving a second pose based at least on the first 3D model point cloud, the 3D surface point cloud, the second 3D model point cloud, the 3D image contour point cloud and the first pose.

BACKGROUND 1. Technical Field

This disclosure relates to the derivation of a pose of a real object.

2. Related Art

Paul J. Besl, Neil D. McKay, “A Method for Registration of 3-D Shapes,”IEEE Transactions on Pattern Analysis and Machine Intelligence (UnitedStates), IEEE Computer Society, February 1992, Vol. 14, No. 2, pp.239-256 discloses the ICP method. The ICP is the abbreviation ofiterative closest point. The ICP method refers to the algorithm used tominimize the difference between two point clouds (to match two pointclouds).

SUMMARY

An advantage of the disclosure is that a pose can be derived with higheraccuracy than with the known ICP method.

The disclosure can be implemented as the following configurations.

An aspect of the disclosure is directed to a non-transitory storagemedium containing program instructions that, when executed by aprocessor, cause the processor to perform a method including: obtaininga first 3D model point cloud associated with surface feature elements ofa 3D model corresponding to a real object in a scene, the first 3D modelpoint cloud being on the 3D model; obtaining a 3D surface point cloudfrom current depth image data of the real object captured with a depthimage sensor; obtaining a second 3D model point cloud associated with 2Dmodel points in a model contour that is obtained from projection of the3D model onto an image plane using a first pose of the 3D model, thesecond 3D model point cloud being on the 3D model; obtaining a 3D imagecontour point cloud at respective intersections of first imaginary linesand second imaginary lines, the first imaginary lines passing throughimage points and the origin of a 3D coordinate system of an imagesensor, the image points being obtained from current intensity imagedata of the real object captured with the image sensor and correspondingto the 2D model points included in the model contour, the secondimaginary lines passing through the second 3D point cloud and beingperpendicular to the first imaginary lines; and deriving a second posebased at least on the first 3D model point cloud, the 3D surface pointcloud, the second 3D model point cloud, the 3D image contour point cloudand the first pose. According to the aspect of the disclosure, thesecond pose is derived using the intensity image data in addition to thedepth image data and the 3D model. Therefore, the second pose can bederived with high accuracy.

In the aspect of the disclosure, the first pose may be a pose of thereal object in a frame before a current frame of the depth image data orthe intensity image data. The second pose may be a pose of the realobject in the current frame of the depth image data or the intensityimage data. According to this configuration, since the future first poseis decided based on the second pose, the future first pose can bederived with high accuracy.

In the non-transitory storage medium, the first pose may be a poseobtained from the image sensor or another image sensor. According tothis configuration, the first pose can be easily derived and theprocessing load is reduced.

The disclosure can be realized in various other configurations, forexample, in the form of a pose derivation method or a device whichrealizes this method.

Another aspect of the disclosure is directed to a method for deriving apose of a real object in a scene including steps of: obtaining a first3D model point cloud associated with surface feature elements of a 3Dmodel corresponding to a real object in a scene, the first 3D modelpoint cloud being on the 3D model; obtaining a 3D surface point cloudfrom current depth image data of the real object captured with a depthimage sensor; obtaining a second 3D model point cloud associated with 2Dmodel points in a model contour that is obtained from projection of the3D model onto an image plane using a first pose of the 3D model, thesecond 3D model point cloud being on the 3D model; obtaining a 3D imagecontour point cloud at respective intersections of first imaginary linesand second imaginary lines, the first imaginary lines passing throughimage points and the origin of a 3D coordinate system of an imagesensor, the image points being obtained from current intensity imagedata of the real object captured with the image sensor and correspondingto the 2D model points included in the model contour, the secondimaginary lines passing through the second 3D point cloud and beingperpendicular to the first imaginary lines; and deriving a second posebased at least on the first 3D model point cloud, the 3D surface pointcloud, the second 3D model point cloud, the 3D image contour point cloudand the first pose.

Still another aspect of the disclosure is directed to a pose derivationdevice including: a function of obtaining a first 3D model point cloudassociated with surface feature elements of a 3D model corresponding toa real object in a scene, the first 3D model point cloud being on the 3Dmodel; a function of obtaining a 3D surface point cloud from currentdepth image data of the real object captured with a depth image sensor;a function of obtaining a second 3D model point cloud associated with 2Dmodel points in a model contour that is obtained from projection of the3D model onto an image plane using a first pose of the 3D model, thesecond 3D model point cloud being on the 3D model; a function ofobtaining a 3D image contour point cloud at respective intersections offirst imaginary lines and second imaginary lines, the first imaginarylines passing through image points and the origin of a 3D coordinatesystem of an image sensor, the image points being obtained from currentintensity image data of the real object captured with the image sensorand corresponding to the 2D model points included in the model contour,the second imaginary lines passing through the second 3D point cloud andbeing perpendicular to the first imaginary lines; and a function ofderiving a second pose based at least on the first 3D model point cloud,the 3D surface point cloud, the second 3D model point cloud, the 3Dimage contour point cloud and the first pose.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosure will be described with reference to the accompanyingdrawings, wherein like numbers reference like elements.

FIG. 1 shows the schematic configuration of an HMD.

FIG. 2 is a functional block diagram of the HMD.

FIG. 3 is a flowchart showing pose derivation processing.

FIG. 4 shows a neighbor discovery range.

FIG. 5 is a flowchart showing a contour feature (CF) method.

FIG. 6 shows the way a 3D image CF point is obtained, based on a 3Dmodel CF point.

FIG. 7 shows an example of similarity score calculation.

FIG. 8 shows an example of similarity score calculation.

FIG. 9 shows an example of similarity score calculation.

FIG. 10 shows an example of similarity score calculation.

FIG. 11 shows an example of similarity score calculation.

FIG. 12 shows that a 2D model point is matched with multiple imagepoints.

FIG. 13 shows an example in which 2D model points are matched with wrongimage points.

DESCRIPTION OF EXEMPLARY EMBODIMENTS

FIG. 1 shows the schematic configuration of an HMD 100. The HMD 100 is ahead-mounted display. The HMD 100 is an optical transmitting-typedevice. That is, the HMD 100 can allow the user to perceive a virtualimage and at the same time directly visually recognize the background.The HMD 100 functions as a device which derives the pose of a realobject, as described later. That is, the HMD 100 executes a method forderiving the pose of a real object.

The HMD 100 has an attachment strap 90 which can be attached to the headof the user, a display section 20 which displays an image, and a controlsection 10 which controls the display section 20. The display section 20allows the user to perceive a virtual image in the state where the HMD100 is mounted on the head of the user. The display section 20 allowingthe user to perceive a virtual image is also referred to as “displayingAR”. The virtual image perceived by the user is also referred to as anAR image.

The attachment strap 90 includes a wearing base section 91 made ofresin, a cloth belt section 92 connected to the wearing base section 91,a camera 60, an inertial sensor 71, and a depth image sensor 80. Thewearing base section 91 is curved to follow the shape of the humanforehead. The belt section 92 is attached around the head of the user.

The camera 60 is an RGB sensor and image sensor. The camera 60 cancapture an image of the background (scene) and is arranged at a centerpart of the wearing base section 91. In other words, the camera 60 isarranged at a position corresponding to the middle of the forehead ofthe user in the state where the attachment strap 90 is attached to thehead of the user. Therefore, in the state where the user wears theattachment strap 90 on his/her head, the camera 60 captures an image ofthe background, which is the scenery of the outside in the direction ofthe user's line of sight, and acquires intensity image data as acaptured image.

The camera 60 includes a camera base 61 which rotates about the wearingbase section 91, and a lens part 62 fixed in relative position to thecamera base 61. The camera base 61 is arranged in such a way as to beable to rotate along an arrow CS1, which is a predetermined range of anaxis included in the plane including the center axis of the user whenthe attachment strap 90 is attached to the head of the user. Therefore,the optical axis of the lens part 62, which is the optical axis of thecamera 60, is changeable in direction within the range of the arrow CS1.The lens part 62 captures a range which changes according to zooming inor out about the optical axis.

The depth image sensor 80 is also referred to as a depth sensor ordistance image sensor. The depth image sensor 80 acquires depth imagedata.

The inertial sensor 71 is a sensor which detects acceleration, and ishereinafter referred to as an IMU (inertial measurement unit) 71. IMU 71can detect angular velocity and geomagnetism in addition toacceleration. The IMU 71 is arranged inside the wearing base section 91.Therefore, the IMU 71 detects the acceleration, angular velocity andgeomagnetism of the attachment strap 90 and the camera base 61.

Since the IMU 71 is fixed in relative position to the wearing basesection 91, the camera 60 is movable with respect to the IMU 71. Also,since the display section 20 is fixed in relative position to thewearing base section 91, the camera 60 is movable in relative positionto the display section 20.

The display section 20 is connected to the wearing base section 91 ofthe attachment strap 90. The display section 20 is in the shape ofeyeglasses. The display section 20 includes a right holding section 21,a right display drive section 22, a left holding section 23, a leftdisplay drive section 24, a right optical image display section 26, anda left optical image display section 28.

The right optical image display section 26 and the left optical imagedisplay section 28 are situated in front of the right and left eyes ofthe user, respectively, when the user wears the display section 20. Oneend of the right optical image display section 26 and one end of theleft optical image display section 28 are connected together at aposition corresponding to the glabella of the user when the user wearsthe display section 20.

The right holding section 21 has a shape extending substantially in ahorizontal direction from an end part ER, which is the other end of theright optical image display section 26, and tilted obliquely upward froma halfway part. The right holding section 21 connects the end part ERwith a coupling section 93 on the right-hand side of the wearing basesection 91.

Similarly, the left holding section 23 has a shape extendingsubstantially in a horizontal direction from an end part EL, which isthe other end of the left optical image display section 28, and tiltedobliquely upward from a halfway part. The left holding section 23connects the end part EL with a coupling section (not illustrated) onthe left-hand side of the wearing base section 91.

As the right holding section 21 and the left holding section 23 areconnected to the wearing base section 91 via the right and left couplingsections 93, the right optical image display section 26 and the leftoptical image display section 28 are situated in front of the eyes ofthe user. The respective coupling sections 93 connect the right holdingsection 21 and the left holding section 23 in such a way that theseholding sections can rotate and can be fixed at arbitrary rotatingpositions. As a result, the display section 20 is provided rotatably tothe wearing base section 91.

The right holding section 21 is a member extending from the end part ER,which is the other end of the right optical image display section 26, toa position corresponding to the temporal region of the user when theuser wears the display section 20.

Similarly, the left holding section 23 is a member extending from theend part EL, which is the other end of the left optical image displaysection 28, to a position corresponding to the temporal region of theuser when the user wears the display section 20. The right display drivesection and the left display drive section 24 (hereinafter collectivelyreferred to as the display drive sections) are arranged on the sidefacing the head of the user when the user wears the display section 20.

The display drive sections include a right liquid crystal display 241(hereinafter right LCD 241), a left liquid crystal display 242(hereinafter left LCD 242), a right projection optical system 251, aleft projection optical system 252 and the like. Detailed explanation ofthe configuration of the display drive sections will be given later.

The right optical image display section 26 and the left optical imagedisplay section 28 (hereinafter collectively referred to as the opticalimage display sections) include a right light guide plate 261 and a leftlight guide plate 262 (hereinafter collectively referred to as the lightguide plates) and also include a light control plate. The light guideplates are formed of a light-transmissive resin material or the like andguide image light outputted from the display drive section to the eyesof the user.

The light control plate is a thin plate-like optical element and isarranged in such a way as to cover the front side of the display section20, which is opposite to the side of the eyes of the user. By adjustingthe light transmittance of the light control plate, the amount ofexternal light entering the user's eyes can be adjusted and thevisibility of the virtual image can be thus adjusted.

The display section 20 also includes a connecting section 40 forconnecting the display section 20 to the control section 10. Theconnecting section 40 includes a main body cord 48, a right cord 42, aleft cord 44, and a connecting member 46.

The right cord 42 and the left cord 44 are two branch cords split fromthe main body cord 48. The display section 20 and the control section 10execute transmission of various signals via the connecting section 40.For the right cord 42, the left cord 44 and the main body cord 48, metalcables or optical fibers can be employed, for example.

The control section 10 is a device for controlling the HMD 100. Thecontrol section 10 has an operation section 135 including anelectrostatic track pad or a plurality of buttons which can be pressed,or the like. The operation section 135 is arranged on the surface of thecontrol section 10.

FIG. 2 is a block diagram functionally showing the configuration of theHMD 100. As shown in FIG. 2, the control section 10 has a ROM 121, a RAM122, a power supply 130, the operation section 135, a CPU 140, aninterface 180, a sending section 51 (Tx51), and a sending section 52(Tx52).

The power supply 130 supplies electricity to each part of the HMD 100.In the ROM 121, various programs are stored. The central processing unit(CPU) 140 develops the various programs stored in the ROM 121 into theRAM 122 and thus executes the various programs. The CPU 140 may includeone or more processors. The various programs include a program havinginstructions for realizing pose derivation processing, described later.

The CPU 140 develops programs stored in the ROM 121 into the RAM 122 andthus functions as an operating system 150 (OS 150), a display controlsection 190, a sound processing section 170, an image processing section160, and a processing section 167.

The display control section 190 generates a control signal to controlthe right display drive section 22 and the left display drive section24. The display control section 190 controls the generation and emissionof image light by each of the right display drive section 22 and theleft display drive section 24.

The display control section 190 sends each of control signals for aright LCD control section 211 and a left LCD control section 212 via thesending sections 51 and 52. The display control section 190 sends eachof control signals for a right backlight control section 201 and a leftbacklight control section 202.

The image processing section 160 acquires an image signal included in acontent and sends the acquired image signal to a receiving section 53and a receiving section 54 of the display section 20 via the sendingsection 51 and the sending section 52. The sound processing section 170acquires an audio signal included in a content, then amplifies theacquired audio signal, and supplies the amplified audio signal to aspeaker (not illustrated) in a right earphone 32 connected to theconnecting member 46 or to a speaker (not illustrated) in a leftearphone 34.

The processing section 167 calculates a pose of a real object bynomography matrix, or by methods described later, for example. The poseof a real object is the spatial relationship between the camera 60 andthe real object. The processing section 167 may calculate a rotationmatrix to convert from a coordinate system fixed on the camera to acoordinate system fixed on the IMU 71, using the calculated spatialrelationship and the detection value of acceleration or the likedetected by the IMU 71. The functions of the processing section 167 areused for the pose derivation processing, described later.

The interface 180 is an input/output interface for connecting variousexternal devices OA which serve as content supply sources, to thecontrol section 10. The external devices Camay include a storage device,personal computer (PC), cellular phone terminal, game terminal and thelike storing an AR scenario, for example. As the interface 180, a USBinterface, micro USB interface, memory card interface or the like can beused, for example.

The display section 20 has the right display drive section 22, the leftdisplay drive section 24, the right light guide plate 261 as the rightoptical image display section 26, and the left light guide plate 262 asthe left optical image display section 28.

The right display drive section 22 includes the receiving section 53(Rx53), the right backlight control section 201, a right backlight 221,the right LCD control section 211, the right LCD 241, and the rightprojection optical system 251. The right backlight control section 201and the right backlight 221 function as a light source.

The right LCD control section 211 and the right LCD 241 function as adisplay element. Meanwhile, in other embodiments, the right displaydrive section 22 may have a self-light-emitting display element such asan organic EL display element, or a scanning display element which scansthe retina with a light beam from a laser diode, instead of the aboveconfiguration. The same applies to the left display drive section 24.

The receiving section 53 functions as a receiver for serial transmissionbetween the control section 10 and the display section 20. The rightbacklight control section 201 drives the right backlight 221, based on acontrol signal inputted thereto. The right backlight 221 is alight-emitting member such as an LED or electroluminescence (EL), forexample. The right LCD control section 211 drives the right LCD 241,based on control signals sent from the image processing section 160 andthe display control section 190. The right LCD 241 is atransmission-type liquid crystal panel in which a plurality of pixels isarranged in the form of a matrix.

The right projection optical system 251 is made up of a collimating lenswhich turns the image light emitted from the right LCD 241 into aparallel luminous flux. The right light guide plate 261 as the rightoptical image display section 26 guides the image light outputted fromthe right projection optical system 251 to the right eye RE of the userwhile reflecting the image light along a predetermined optical path. Theleft display drive section 24 has a configuration similar to that of theright display drive section 22 and corresponds to the left eye LE of theuser and therefore will not be described further in detail.

Calibration using the IMU 71 and the camera 60 varies in accuracy,depending on the capability of the IMU 71 as an inertial sensor. If aninexpensive IMU with lower accuracy is used, significant errors anddrifts may occur in the calibration.

In the embodiment, calibration is executed, based on a batchsolution-based algorithm using a multi-position method with the IMU 71.In the embodiment, design data obtained in manufacturing is used for thetranslational relationship between the IMU 71 and the camera 60.

Calibration is executed separately for the IMU 71 and for the camera 60(hereinafter referred to as independent calibration). As a specificmethod of independent calibration, a known technique is used.

In the independent calibration, the IMU 71 is calibrated. Specifically,with respect to a 3-axis acceleration sensor (Ax, Ay, Az), a 3-axis gyrosensor (Gx, Gy, Gz), and a 3-axis geomagnetic sensor (Mx, My, Mz)included in the IMU 71, the gain/scale, static bias/offset, and skewamong the three axes are calibrated.

As these calibrations are executed, the IMU 71 outputs acceleration,angular velocity, and geomagnetism, as output values of the respectivesensors for acceleration, angular velocity, and geomagnetism. Theseoutput values are obtained as the result of correcting the gain, staticbias/offset, and misalignment among the three axes. In the embodiment,these calibrations are carried out at a manufacturing plant or the likeat the time of manufacturing the HMD 100.

In the calibrations on the camera 60 executed in the independentcalibration, internal parameters of the camera 60 including focallength, skew, principal point, and distortion are calibrated. A knowntechnique can be employed for the calibration on the camera 60.

After the calibration on each sensor included in the IMU 71 is executed,the detection values (measured outputs) from the respective sensors foracceleration, angular velocity, and geomagnetism in the IMU 71 arecombined. Thus, IMU orientation with high accuracy can be realized.

In the embodiment, as described later, the pose of a real object isimproved. An outline of the pose improvement will now be described. Thepose improvement is important in real object detection and poseestimation (OD/PE) and can be utilized in various applications such asaugmented reality, robot, or self-driving car.

The method in the embodiment includes an appearance-based method calleda model alignment method (MA) and a method called a contour featureelement method (CF). The appearance-based method is a method in whichthe color of a pixel in the foreground and the color of a pixel in thebackground are optimized. The contour feature element method is anedge-based method in which the correspondence between a 3D model and 2Dimage points is established using an outer contour line of a realobject.

The MA method and the CF method are based solely on intensity imagedata. In the embodiment, a 3D surface-based method using depth imagedata is used as well. The methods in the embodiment are based on theiterative closest point (ICP) algorithm. With the iterative closestpoint algorithm, the correspondence between points is established byusing the shortest Euclidean distance within a predetermined neighbordiscovery (or search) size as a reference.

Since some of initial poses are largely deviated from the real pose, aneighbor discovery size is selected, adaptively based on depthverification scores. In the embodiment, this algorithm is called anadapted iterative closes point (a-ICP) method.

A scenario which has a high degree of difficulty in the OD/PE (that is,challenging) and the pose improvement is to carryout in an untidyenvironment (complicated or cluttered background). A precondition inthis case is that most of the real object is still visible (that is, theocclusion is slight).

The performance of the MA method drops generally in an untidy scenariobecause the foreground and the background can no longer be discriminatedfrom each other even if the appearance is used. Thus, the embodimentfocuses on a pose improvement algorithm using the CF method and thea-ICP method.

FIG. 3 is a flowchart showing the pose derivation processing. Thisflowchart is for deriving the pose of a real object, combining the CFmethod and the a-ICP method. Therefore, either the acquisition of databy the CF method or the acquisition of data by the a-ICP method may becarried out first. In the description below, the a-ICP method is carriedout first. The 3D model in the embodiment is a model prepared using 3DCAD.

First, information of 3D model surface points and 3D image surface-basedpoints is acquired using the a-ICP method (S300).

Here, the a-ICP method will be described. The a-ICP method is based onthe ICP method. The ICP method refers to an algorithm used to minimizethe difference between two point clouds, as described above. Since theICP method is known, its outline will be briefly described.

The 3D model surface points form a point cloud (a set of points)associated with surface feature elements on a 3D model corresponding tothe real object. The 3D model is prepared in advance. The 3D modelsurface points are predetermined. The 3D model surface points are alsoreferred to as a first 3D model point cloud.

The 3D image surface-based points are data acquired from the currentdepth image sensor and form a 3D surface point cloud. That is, the 3Dimage surface-based points are data representing the distance to each ofsurface feature elements of the real object.

The ICP method decides the pose of the 3D model in such a way that thedifference in position between the 3D model surface points and the 3Dimage surface-based points is minimized. However, in the embodiment,since the pose is improved in S500, the pose of the 3D model is notdecided in S300.

The a-ICP is the abbreviation of adapted iterative closes point. Thatis, the a-ICP method refers to an adapted ICP method. The term “adapted”means that the pose is roughly aligned if the current pose is notsimilar to the final pose, whereas the pose is finely aligned if thecurrent pose is similar to the final pose.

Specifically, using two different ICP parameters, either rough alignmentor fine alignment is achieved. The two parameters are as follows.

The first parameter is a parameter representing how finely the pointcloud is sampled. At rough levels, an overall combination is emphasized.That is, a combination using an overall shape based a roughly sampledpoint cloud is emphasized. Meanwhile, at fine levels, the combination ofpoints is emphasized.

The second parameter is a parameter representing the size of theneighbor discovery range for establishing the correspondence.

FIG. 4 shows neighbor discovery ranges. In FIG. 4, a neighbor discoveryrange SW1 at a finer level and a neighbor discovery range SW2 at arougher level are shown. The neighbor discovery ranges SW1, SW2 defineranges to search in a CAD point cloud CPC, corresponding to points(X_(I), Y_(I), Z_(I)) included in a scene point cloud SPC.

In FIG. 4, the scene point cloud SPC is expressed in a structured(mesh-like) two-dimensional arrangement. The CAD point cloud CPC isexpressed in a structured two-dimensional arrangement formed byre-projection using the current view of the real object.

The neighbor discovery range SW2 at the rougher level enables findingout the correspondence between points which are distant from each other.Consequently, it is possible for the point cloud to move over a longdistance.

Meanwhile, the neighbor discovery range SW1 at a finer level placeslimitation so as to prevent the pose estimation by the CPU 140 fromdiverging from the true pose.

Using a-ICP method, N_(aICP) (in the embodiment, significantly more than100) combinations of 3D model surface points and 3D image surface-basedpoints are acquired. That is, the relationship between the pose of thereal object that is acquired most recently and the current (latest)depth image data is acquired with respect to N_(aICP) surface featureelements. The most recently acquired pose is the pose in a frame beforethe current frame. The most recently acquired pose is also referred toas a first pose.

As described later, the first pose is improved in S500 and thus turnsinto a second pose. The second pose is the pose in the current frame.The second pose serves as the first post in the next S300 and S400.

Subsequently, using the CF method, N_(CF) (in the embodiment, 100)combinations of 3D model CF points Pm-3d and 3D image CF points Pimg-3dare acquired (S400).

FIG. 5 is a flowchart showing the CF method. First, an image of a realobject is captured using the camera 60 (S421). The image acquired inS421 is intensity image data including a plurality of image points onthe real object, and its background.

Subsequently, edge detection is executed on the captured image of thereal object (S423). For the edge detection, feature elements to form anedge are calculated, based on pixels in the captured image. In theembodiment, the gradient vector (also referred to simply as “gradient”)of intensity of each pixel in the captured image of the real object iscalculated, thereby deciding feature elements. In the embodiment, inorder to detect an edge, edges are simply compared with a threshold andthose not reaching the maximum are suppressed (non-maxima suppression),as in the procedures of the Canny edge detection method.

Next, 3D model CF points Pm-3d are acquired (S429). The 3D model CFpoints Pm-3d form a point cloud associated with contour feature elementson a 3D model in a first pose. The first pose used in the CF method isthe same as the first pose used in the a-ICP method. The contour featureelements are predetermined on the 3D model. The 3D model CF points Pm-3dare also referred to as a second 3D model point cloud. The 3D model CFpoints Pm-3d are represented on a 3D coordinate system (3D modelcoordinate system) with its origin fixed to the 3D model.

Next, based on the 3D model CF points Pm-3d, 2D model points Pm-2d areacquired (S432). FIG. 6 is a conceptual view showing how S432 to S438are carried out. S432 is realized by projecting the 3D model CF pointsPm-3d onto an image plane IP, based on the first pose. The image planeIP is synonymous with the sensor surface of the camera 60. The imageplane IP is a virtual plane. The 2D model points Pm-2d are representedon a 2D coordinate system (image plane coordinate system) with itsorigin placed on the image plane IP.

In circumstances where the first pose of the 3D model cannot be used, 3Dmodel CF points Pm-3d cannot be acquired from the 3D model and hence the2D model points Pm-2d cannot be acquired based on the 3D model CF pointsPm-3d. Such circumstances can take place in the case of executinginitialization or re-initialization. The initialization is the case ofdetecting the pose of the real object for the first time. There-initialization is the case of detecting the pose of the real-objectagain if the pose of the real object is detected and then lost.

In such cases, 2D model points Pm-2d are acquired by using a 2Dtemplate, instead of S429 and S432. Specifically, the followingprocedures are taken.

First, from among a plurality of 2D templates that are stored, a 2Dtemplate generated from a view that is the closest to the pose of thereal object captured in the image is selected. The 2D templatecorresponds to the real object captured in the image and reflects theposition and pose of the real object. The control section 10 stores aplurality of 2D templates in advance.

Here, each 2D template is data prepared based on each 2D model obtainedby rendering the 3D model corresponding to the real object onto theimage plane IP based on each view.

A view includes a three-dimensional rigid body conversion matrixrepresenting rotation and translation with respect to a virtual cameraand a perspective image (perceptive projection) conversion matrixincluding camera parameters. Specifically, each 2D template includes 2Dmodel points Pm-2d corresponding to contour feature elements included inthe contour (outer line) of the 2D model, 3D model CF points Pm-3dcorresponding to the 2D model points Pm-2d, and the view. In the case ofusing a 2D template, feature points on the 2D model are acquired as 2Dmodel points Pm-2d.

After the 2D model points Pm-2d are acquired, the correspondence betweenimage points included in the edge of the image of the real object andthe 2D model points Pm-2d is established (S434).

In the embodiment, in order to establish the correspondence, first,similarity scores are calculated using the following equation (1), forall of the image points included in the local vicinities of each of theprojected 2D model points.

$\begin{matrix}{{{SIM}\left( {p,p^{\prime}} \right)} = {{{\underset{E_{p}}{\rightarrow}{\cdot {\underset{\nabla}{\rightarrow}I_{p^{\prime}}}}}}/{\max\limits_{q \in {N{(p)}}}{{\underset{\nabla}{\rightarrow}I_{p}}}}}} & (1)\end{matrix}$

In the equation (1), p represents the 2D model point Pm-2d, and p′represents the image point. The indicator of the similarity scoreexpressed by the equation (1) is based on the coincidence between thegradient of intensity of the 2D model point Pm-2d and the gradient ofthe image point. However, in the equation (1), as an example, theindicator of the similarity score is based on the inner product of thetwo vectors. The vector of Ep in the equation (1) is the unit lengthgradient vector of the 2D model points Pm-2d (edge point).

In the embodiment, when finding similarity scores, ΔI, which is thegradient of a test image (input image), is used in order to calculatefeature elements of the image point p′. The normalization of themagnitude of the gradient with a local maximum value, expressed by thedenominator of the equation (1), ensures that priority is given to anedge with locally high intensity. This normalization prevents collationwith an edge that is weak and may result in noise.

In the embodiment, when finding similarity scores, the size N(p) of thevicinity range where the correspondence is searched for can be enhanced.For example, in continuous iterative calculations, if the average of thepositional displacements of the projected 2D model points Pm-2d isdecreased, N(p) can be reduced. Hereinafter, a specific method forestablishing the correspondence using the equation (1) will be describedas an example.

FIGS. 7 to 11 show an example of a method for establishing thecorrespondence between 2D model points Pm-2d and image points, based onsimilarity scores. In FIG. 7, an image IMG (solid line) of a real objectcaptured by the camera 60, a 2D model MD (chain-dotted line), andcontour feature elements CFm as 2D model points Pm-2d are shown. The 2Dmodel MD is a two-dimensional contour line obtained by projecting a 3Dmodel in the first pose onto the image plane IP.

In FIG. 7, a plurality of pixels px arranged in the form of a lattice,and areas formed by three by three pixels with each of the contourfeature elements CFm situated at the center (for example, area SA1) areshown.

In FIG. 7, an area SA1 with a contour feature element CF1 situated atits center, an area SA2 with a contour feature element CF2 situated atits center, and an area SA3 with a contour feature element CF3 situatedat its center are shown, as described later.

The contour feature element CF1 and the contour feature element CF2 arecontour feature elements next to each other. Similarly, the contourfeature element CF1 and the contour feature element CF3 are contourfeature elements next to each other. In other words, the contour featureelements are arranged in the order of the contour feature element CF2,the contour feature element CF1, and the contour feature element CF3.

Since the image IMG of the real object and the 2D model MD do notcoincide with each other, as shown in FIG. 7, the correspondence betweenimage points included in the edges of the image IMG of the real objectand 2D model points Pm-2d represented by each of a plurality of contourfeature elements CFm is established, using the equation (1).

First, the one contour feature element CF1 of the plurality of contourfeature elements CFm is selected, and the area SA1 made up of three bythree pixels in which the pixel px corresponding to the position of thecontour feature element CF1 is situated at its center is extracted.

Next, the area SA2 and the area SA3 each of which is made up of three bythree pixels and in which the contour feature element CF2 and thecontour feature element CF3 both next to the contour feature element CF1are situated at their respective centers, are extracted.

In the embodiment, scores are calculated using the equation (1), foreach pixel px forming each of the areas SA1, SA2 and SA3. At this stage,all of the areas SA1, SA2 and SA3 are matrices having the same shape andthe same size.

FIG. 8 shows an enlarged view of the area SA2 and similarity scorescalculated for each pixel forming the area SA2. FIG. 9 shows an enlargedview of the area SA1 and similarity scores calculated for each pixelforming the area SA1. FIG. 10 shows an enlarged view of the area SA3 andsimilarity scores calculated for each pixel forming the area SA3.

In the embodiment, similarity scores between the 2D model point as thecontour feature element and each of nine image points, in the extractedareas, are calculated. For example, in the area SA3 of FIG. 10, thepixels px33 and px36 score 0.8, the pixel px39 scores 0.5, and the othersix pixels score 0.

The difference in score, that is, the pixels px33 and px36 scoring 0.8and the pixel px39 scoring 0.5, is due to the curving of the image IMGof the real object at the pixel px39, causing the gradient to differ. Asdescribed above, similarity scores are calculated by a similar methodfor each pixel (image point) forming the extracted areas SA1, SA2 andSA3.

Hereinafter, the description focuses on the contour feature element CF1(FIGS. 9 and 11). Corrected scores for each pixel forming the area SA1are calculated (FIG. 11). Specifically, for each pixel forming the areaSA1, similarity scores are averaged with a weight coefficient, using thepixels situated at the same matrix positions in each of the areas SA2and SA3.

Such correction of similarity scores is executed not only on the contourfeature element CF1 but also on each of the other contour featureelements CF2 and CF3. This has the effect of smoothing thecorrespondence between the 2D model point and image points.

In the embodiment, corrected scores are calculated, using a weightcoefficient of 0.5 for the score of each pixel px in the area SA1, aweight coefficient of 0.2 for the score of each pixel px in the areaSA2, and a weight coefficient of 0.3 for the score of each pixel px inthe area SA3.

For example, as shown in FIG. 11, the corrected score of 0.55 of thepixel px19 is a value obtained by adding the score of 0.8 of the pixelpx19 in the area SA1 multiplied by the weight coefficient of 0.5, thescore of 0 of the pixel px29 in the area SA2 multiplied by the weightcoefficient of 0.2, and the score of 0.5 of the pixel px39 in the areaSA3 multiplied by the weight coefficient of 0.3.

The weight coefficients are inversely proportional to the distancesbetween the contour feature element CF1 as the processing target and theother contour feature elements CF2 and CF3.

In the embodiment, the image point having the highest score, of thecorrected scores of the pixels forming the area SA1, is decided as theimage point corresponding to the contour feature element CF1.

For example, the highest value of the corrected scores is 0.64 of thepixels px13 and px16. If a plurality of pixels has the same correctedscore, the pixel px16 with the shortest distance from the contourfeature element CF1 is chosen and the pixel px16 is made to correspondto the contour feature element CF1.

By comparing the edge detected in the image of the real object(candidate of a part of the contour) and the 2D model points Pm-2d(contour feature elements CF), image points of the real objectcorresponding to the respective 2D model points Pm-2d are decided. Thus,the image points corresponding to the 2D model points Pm-2d included inthe contour feature elements are called 2D image points Pimg-2d. Asanother method for searching for the correspondence between 2D modelpoints and image points, the following method may be employed instead ofthe above method. First, similarity scores or corrected scores arederived for a plurality of image points falling on a line segment whichis perpendicular to the contour line of the 2D model and passes throughthe 2D model point Pm-2d. Then, the image point having the highestsimilarity/corrected score on the line segment is defined as the 2Dimage point Pimg-2d corresponding to the 2D model point Pm-2d.

FIGS. 12 and 13 show the correspondence that can occur if the abovemethod is not employed in procedures for establishing correspondence. Byusing the method according to the embodiment to establish thecorrespondence between 2D model points Pm-2d and image points, thepossibility of errors as shown in FIG. 12 or FIG. 13 can be reduced.

FIGS. 12 and 13 show enlarged views of a part of the captured image IMGof the real object and a set PMn of 2D model points Pm-2d, and aplurality of arrows CS.

FIG. 12 shows that one 2D model point Pm-2d can be matched with multipleimage points included in one edge. That is, there is a plurality ofoptions such as arrows CS1 to CS5 to decide which part of the edgedetected as the image IMG of the real object the 2D model point Pm-2dcorresponds to.

FIG. 13 shows an example in which 2D model points Pm-2d are matched withwrong image points. Specifically, a plurality of 2D model points PM1 toPM5 are wrongly matched with (image points included in) the edgedetected as the image IMG of the real object.

In this case, for example, even if the 2D model points are arranged inthe order of PM2, PM3, PM1, PM4 and PM5 from the top in FIG. 13, thearrows are arranged in the order of CS7, CS6, CS8, CS10 and CS9 as theedge of the image IMG of the real object. Therefore, the arrows CS8 andCS6, and the arrows CS9 and CS10 are switched.

Back to FIG. 6, imaginary lines Ray-img passing through a camera originO (origin of the camera coordinate system) and respective 2D imagepoints Pimg-2d is calculated (S436). The imaginary line Ray-img is astraight line defined on the 3D coordinate system.

Finally, 3D image CF points Pimg-3d are acquired (S438). The 3D image CFpoints Pimg-3d are also referred to as 3D image contour points. The 3Dimage CF points Pimg-3d are acquired by projection from thecorresponding 3D model CF points Pm-3d to the corresponding imaginaryline Ray-img. Specifically, a 3D image CF point Pimg-3d is the foot of aperpendicular line drawn from the corresponding 3D image CF pointPimg-3d to the corresponding imaginary line Ray-img.

As described above, using the CF method, N_(CF) combinations of 3D modelCF points Pm-3d and 3D image CF points Pimg-3d are acquired.

Next, the update of the pose is calculated (S500). The pose in thecurrent frame is derived in S500. The pose thus derived is called asecond pose. The second pose is derived, based at least on the 3D modesurface points (first 3D model point cloud), the 3D image surface-basedpoints (3D surface point cloud), the 3D model CF points Pm-3d (second 3Dmodel point cloud), and the 3D image CF points Pimg-3d.

If a 3D point correspondence (p, p′) set made up of N points is given,the pose is optimized by finding R and T that minimize the sum ofsquares (Σ²) of the distance difference. The sum of squares of thedistance difference is calculated by the following equation.

$\begin{matrix}{\Sigma^{2} = {\sum\limits_{i = 1}^{N}{{p_{i}^{\prime} - \left( {{Rp}_{i} + T} \right)}}^{2}}} & (2)\end{matrix}$

R in the equation (2) is a rotating element in a conversion matrix. T inthe equation is a translating element in the conversion matrix.

These can easily be linearly coupled with respect to both the CF dataand the a-ICP data expressed from 3D to 3D domain. However, in theembodiment, the origin on the camera 60 coordinate system (3D coordinatesystem of the RGB image sensor) and the origin on the distance cameracoordinate system (3D coordinate system of the depth image sensor 80)are different from each other. Therefore, in the embodiment, eachcorrespondence set is converted to a common coordinate system (forexample, the 3D coordinate system of the robot or the 3D coordinatesystem of the display section 20 of the HMD 100). The minimizationfunction after this conversion is simply the linear sum of error terms.

$\begin{matrix}{\Sigma^{2} = {{\sum\limits_{i = 1}^{N_{aICP}}{{{Dp}_{i}^{\prime} - \left( {{DRp}_{i} + T} \right)}}^{2}} + {\sum\limits_{j = 1}^{N_{CF}}{{{Cp}_{j}^{\prime} - \left( {{RCp}_{j} + T} \right)}}^{2}}}} & (3)\end{matrix}$

D in the equation is a conversion matrix and represents “basic change”from the distance camera coordinate system to the common coordinatesystem. C in the equation is a conversion matrix and represents “basicchange” from the camera coordinate system to the common coordinatesystem with respect to each color.

R and T in the equation are closed-form solutions (analytical solutions)of the equation (3). Therefore, in the search for the minimum value ofthe function, a nonlinear least squares method such as the Gauss-Newtonmethod is not necessary.

After S500, whether to end the improvement of the pose or not isdetermined (S510). That is, whether to carry out S500 repeatedly or notis determined. If the improvement of the pose is not to end (S510, NO),S300 to S500 are executed again. Thus, the derivation of a conversionmatrix (R and T) corresponding to the acquired image frame is continuedand consequently the pose of the real object can be tracked.

If the improvement of the pose is to end (S510, YES), the final pose isreturned (S520). That is, the conversion matrix (R and T) calculated inthe most recent S500 is outputted.

According to the processing described above, disadvantages observed inthe case where each of the pose improvement method based on the CFmethod and the pose improvement method based on the a-ICP method isindependently used. Advantages of the CF method and the a-ICP methodwill now be described.

An advantage of the CF method is that high accuracy is secured in aclean (isolated) state. The clean state refers to the state where thecontour can be clearly distinguished from the background.

A disadvantage of the CF method is that accuracy may be low in an untidystate, particularly with respect to dark real objects whose outer edgesare confused with each other. Also, the method is not always robust tothe scaling of real objects but this can be improved by using a stereocamera or multiple cameras.

An advantage of the a-ICP method is that high accuracy is secured bothin the clean state and in the untidy state.

A disadvantage of the a-ICP method is that accuracy may be low withrespect to real objects having very ordinary surfaces (surface with noparticular features) such as a flat surface or cylinder. This is becausethe correspondence between neighboring points has high ambiguity.

As described above, the disadvantage of the CF method and thedisadvantage of the a-ICP method can be regarded as independent of eachother. Therefore, according to the embodiment, the pose can beaccurately derived by compensating for the disadvantages of the twomethods.

This disclosure is not limited to the embodiments, examples andmodifications described in the specification and can be realized withvarious other configurations without departing from the scope of thedisclosure. For example, technical features in the embodiments, examplesand modifications corresponding to technical features in eachconfiguration described in the summary section can be adaptivelyreplaced or combined in order to solve a part or the entirety of theforegoing problems or in order to achieve a part or the entirety of theforegoing advantageous effects. Such technical features can beadaptively deleted unless described as essential in the specification.For example, the following examples can be employed.

The first pose need not be a pose in a frame preceding the currentframe. For example, the pose of the real object acquired from the camera60 (image sensor) may be used as the first pose. In the case ofacquiring the pose of the real object from the camera 60, the a-ICPmethod may be used and the ICP may be used as well.

Alternatively, the pose of the real object acquired from the depth imagesensor 80 may be used as the first pose. In the case of acquiring thepose of the real object from the depth image sensor 80, the CF methodmay be used.

As described above, in the case where the first pose is derived based onthe camera 60 or another image sensor (depth image sensor 80),processing load is reduced.

The ratio of the number of CF points to the number of a-ICP points maybe adaptively set. The number of a-ICP points can vary depending on theadaptation level. However, in any case, the number of a-ICP points ismuch greater than the number of CF points. The sampling of a-ICP pointscan change in such a way as to realize a function of local geometry(geometric structure). For example, a flat area communicates littledescriptive information and therefore does not need dense sampling.

A reliability element may be added to the correspondence of 3D points.The reliability element is a coefficient representing reliability. Thiscan be done by introducing a N×N diagonal matrix. Here, each diagonalelement is the reliability element of each point. The reliabilityelement can be calculated, based on the magnitude of the gradient vectorof the CF point, for example. Alternatively, the reliability element canbe calculated, based on the surface ambiguity of the a-ICP point.

The number of adaptation levels in the a-ICP method may be adaptivelychanged.

The device which executes the pose derivation processing may be anydevice having a computing function. For example, a video see-through HMDmay be employed, and devices other than the HMD may be employed as well.The devices other than the HMD may include a robot, portable displaydevice (for example, smartphone), head-up display (HUD), or stationarydisplay device.

In the above description, a part or the entirety of the functions andprocessing realized by software may be realized by hardware. Meanwhile,a part or the entirety of the functions and processing realized byhardware may be realized by software. As the hardware, various circuitsmay be used such as an integrated circuit, discrete circuit, or circuitmodule made up of a combination of these.

The entire disclosure of Japanese Patent Application No. 2016-227595,file on Nov. 24, 2016, is incorporated by reference herein.

What is claimed is:
 1. A non-transitory storage medium containingprogram instructions that, when executed by a processor, cause theprocessor to perform a method comprising: obtaining a first 3D modelpoint cloud associated with surface feature elements of a 3D modelcorresponding to a real object in a scene, the first 3D model pointcloud being on the 3D model; obtaining a 3D surface point cloud fromcurrent depth image data of the real object captured with a depth imagesensor; obtaining a second 3D model point cloud associated with 2D modelpoints in a model contour that is obtained from projection of the 3Dmodel onto an image plane using a first pose of the 3D model, the second3D model point cloud being on the 3D model; obtaining a 3D image contourpoint cloud at respective intersections of first imaginary lines andsecond imaginary lines, the first imaginary lines passing through imagepoints and the origin of a 3D coordinate system of an image sensor, theimage points being obtained from current intensity image data of thereal object captured with the image sensor and corresponding to the 2Dmodel points included in the model contour, the second imaginary linespassing through the second 3D point cloud and being perpendicular to thefirst imaginary lines; and deriving a second pose based at least on thefirst 3D model point cloud, the 3D surface point cloud, the second 3Dmodel point cloud, the 3D image contour point cloud and the first pose.2. The non-transitory storage medium according to claim 1, wherein thefirst pose is a pose of the real object in a frame before a currentframe of the depth image data or the intensity image data, and thesecond pose is a pose of the real object in the current frame of thedepth image data or the intensity image data.
 3. The non-transitorystorage medium according to claim 1, wherein the first pose is a poseobtained from the image sensor or another image sensor.
 4. A method forderiving a pose of a real object in a scene comprising steps of:obtaining a first 3D model point cloud associated with surface featureelements of a 3D model corresponding to a real object in a scene, thefirst 3D model point cloud being on the 3D model; obtaining a 3D surfacepoint cloud from current depth image data of the real object capturedwith a depth image sensor; obtaining a second 3D model point cloudassociated with 2D model points in a model contour that is obtained fromprojection of the 3D model onto an image plane using a first pose of the3D model, the second 3D model point cloud being on the 3D model;obtaining a 3D image contour point cloud at respective intersections offirst imaginary lines and second imaginary lines, the first imaginarylines passing through image points and the origin of a 3D coordinatesystem of an image sensor, the image points being obtained from currentintensity image data of the real object captured with the image sensorand corresponding to the 2D model points included in the model contour,the second imaginary lines passing through the second 3D point cloud andbeing perpendicular to the first imaginary lines; and deriving a secondpose based at least on the first 3D model point cloud, the 3D surfacepoint cloud, the second 3D model point cloud, the 3D image contour pointcloud and the first pose.
 5. A pose derivation device comprising: afunction of obtaining a first 3D model point cloud associated withsurface feature elements of a 3D model corresponding to a real object ina scene, the first 3D model point cloud being on the 3D model; afunction of obtaining a 3D surface point cloud from current depth imagedata of the real object captured with a depth image sensor; a functionof obtaining a second 3D model point cloud associated with 2D modelpoints in a model contour that is obtained from projection of the 3Dmodel onto an image plane using a first pose of the 3D model, the second3D model point cloud being on the 3D model; a function of obtaining a 3Dimage contour point cloud at respective intersections of first imaginarylines and second imaginary lines, the first imaginary lines passingthrough image points and the origin of a 3D coordinate system of animage sensor, the image points being obtained from current intensityimage data of the real object captured with the image sensor andcorresponding to the 2D model points included in the model contour, thesecond imaginary lines passing through the second 3D point cloud andbeing perpendicular to the first imaginary lines; and a function ofderiving a second pose based at least on the first 3D model point cloud,the 3D surface point cloud, the second 3D model point cloud, the 3Dimage contour point cloud and the first pose.